Automatic Acquisition of Hyponyms from Large Text Corpora

نویسنده

  • Marti A. Hearst
چکیده

We describe a method for the automatic acquisition of the hyponymy lexical relation from unrestricted text. Two goals motivate the approach: (i) avoidance of the need for pre-encoded knowledge and (ii) applicability across a wide range of text. We identify a set of lexico-syntactic patterns that are easily recognizable, that occur iYequently and across text genre boundaries, and that indisputably indicate the lexical relation of interest. We describe a method for discovering these patterns and suggest that other lexical relations will also be acquirable in this way. A subset of the acquisition algorithm is implemented and the results are used to attgment and critique the structure of a large hand-built thesaurus. Extensions and applications to areas such as information retrieval are suggested. 1 I n t r o d u c t i o n Currently there is much interest in the automatic acquisition of lexiea[ syntax and semantics, with the goal of building up large lexicons for natural lain guage processing. Projects that center around extracting lexical information from Machine Readable Dictionaries (MRDs) have shown much success but are inherently limited, since the set of entries within a dictionary is fixed. In order to find terms and expressions that are not defined in MRDs we must turn to other textual resources. For this purpose, we view a text corpus not only as a source of information, but also as a source of information about the language it is written in. When interpreting unrestricted, domain-independent text, it is difficult to determine in advance what kind of infbrmation will be encountered and how it will be expressed. Instead of interpreting everything in the text in great detail, we can searcil for specific lexical relations that are expressed in well-known ways. Surprisingly useful information can be found with only a very simple understanding of a text. Consider the following sentence: 1. (SI) The bow l u t e , such as the Bambara ndang, is p lucked and has an i n d i v i d u a l curved neck :for each string. Most fluent readers of English who }lave never before encountered the term 'q3amhara ndang" will nevertheless from this sentence infer that a "Bambara udang" is a kind of "bow Iute". This is true even if tile reader has only a fuzzy conception of what a how lute is. Note that the attthor of the sentence is not deliberately defining the term, as would a dictionary or a children's book containing a didactic sentence like A Bambara ndang is a kind of bow lute. However, the semantics of the lexico-syntactic construction indicated by the pattern: ( la) N P o ..... h as {NP1, NP2 . . . . (and Ior)} NP,, are such that they imply (lb) for all N P , , 1 < i < n, hyponym(NPi , NPo) Thus from sentence (SI) we conclude hyponym ( "Barn bare n dang", "how lu re"). We use the term hyponym similarly to the sense used in (Miller et el. 1990): a concept represented by a lexicaI item L0 is said to be a hyponym of the concept represented by a lexical item LI if native speakers of English accept sentences constructed from the frame An Lo is a (kind of) L1. Here Lt is the hypernym of Lo and the relationship is reflexive and transitive, but not symmetric. This example shows a way to discover a hyponymic lexical relationship between two or more noun phrases in a naturally-occurring text. This approach is simllar in spirit to the pattern-based interpretation techniques being used in MRD processing. For example, t All examples in this paper are real text, taken from Grolter's Amerwan Acaderntc Encyclopedia(Groher tg00) AcrF.s DE COLING-92, NANTI~S, 23-28 Aol}r 1992 5 3 9 PROC. OV COLING-92, NhNTIIS, AUG. 23-28, 1992 (Alshawi 1987), in interpreting LDOCE definitions, uses a hierarchy of patterns which consist mainly of part-of-speech indicators and wildcard characters. (Markowitz e~ al. 1986), (Jensen & Binot 1987), and (Nakamura & Nagao 1988) also use pattern recognition to extract semantic relations such as taxonomy from various dictionaries. (Ahlswede & Evens I988) compares an approach based on parsing Webster's 7th definitions with one based on pattern recognition, and finds that for finding simple semantic relations, pattern recognition [s far more accurate and efficient than parsing. The general feeling is that the structure and function of MRDs makes their interpretation amenable to pattern-recognition techniques. Thus one could say by interpreting sentence (S1) according to (In-b) we are applying pattern-based relation recognition to general texts. Since one of the goals of building a lexical hierarchy automatically is to aid in the construction of a natural language processing program, this approach to acquisition is preferable to one that needs a complex parser ~nd knowledge base. The tradeoff is that the the reformation acquired is coarse-grained. There are many ways that the structure of a language can indicate the meanings of lexical items, but the difficulty lies in finding constructions that frequently and reliably indicate the relation of interest. It might seem tbat because free text is so varied in form and content (as compared with the somewhat regular structure of the dictionary) that it may not be possible to find such constructions. However, we have identified a set of lexico-syntactic patterns, including the one shown in (In) above, that indicate the hyponymy relation and that satisfy the following desiderata: (i) They occur frequently and in many text genres. (ii) They (almost) always indicate the relation of interest. (iii) They can be recognized with little or no preencoded knowledge. Item (i) indicates that the pattern will result in the discovery of many instances of the relation, item (ii) that the information extracted will not be erroneous, and item (iii) that making use of the pattern does not require the tools that it is intended to help build. Finding instances of the hyponymy relation is useful for several purposes: Lex icon A u g m e n t a t i o n . Hyponymy relations can be used to augment and verify existing lexicons, including ones built from MRDs. Section 3 of this paper describes an example, comparing results extracted from a text corpus with information stored in the noun hierarchy of WordNet ((Miller et al. 1990)), a hand-built lexical thesaurus. N o u n P h r a s e Semant i c s . Another purpose to which these relations can be applied is the identification of the general meaning of an unfamiliar noun phrases. For example, discovering the predicate hyponym( "broken bone", "injury") indicates that tbe term "broken bone" can be understood at some level as an "injury" without having to determine the correct senses of the component words and how they combine. Note also that a term like "broken bone" is not likely to appear in a dictionary or lexicon, although it is a common locution. S e m a n t i c R e l a t e d n e s s I n f o r m a t i o n . There bas recently been work in the detection of semantically related nouns via, for example, shared argument structures (Hindle 1990), and shared dictionary definition context (Wilks e¢ al. 1990). These approaches attempt to infer relationships among [exical terms by looking at very large text samples and determining which ones are related in a statistically significant way. The technique introduced in this paper can be seen as having a similar goal but an entirely different approach, since only one sample need be found in order to determine a salient relationship (and that sample may be infrequently occurring or nonexistent). Thinking of the relations discovered as closely related semantically instead of as hyponymic is most felicitous when the noun phrases involved are modified and atypical. Consider, for example, the predicate hyponym( "detonating explosive", "blasting agent") This relation may not be a canonical ISA relation but the fact that it was found in a text implies that the terms' meanings are close. Connecting terms whose expressions are quite disparate but whose meanings are similar should be useful for improved synonym expansion in information retrieval and for finding chains of semantically related phrases, as used in the approach to recognition of topic boundaries of (Morris Hirst 1991). We observe that terms that occur in a list are often related semantically, whether they occur in a hyponymy relation or not. In the next section we outline a way to discover these lexico-syntactic patterns as well as illustrate those we have found. Section 3 shows the results of searching texts for a restricted version of one of the patterns and compares the results against a hand-built thesaurus. Section 4 is a discussion of the merits of this work and describes future directions. 2 Lexico-Syntact ic Patterns for Hyponymy Since only a subset of the possible instances of the hyponymy relation will appear in a particular form, we need to make use of as many patterns as possible. Below is a list of lexico-syntactie patterns that indicate the hyponymy relation, followed by illustrative sentence fragments and the predicates that can ACTI~S DE COLING-92, NANTES, 23-28 AOt~r 1992 5 4 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 be derived from them (detail about the env i ronment surrounding tile pat terns is omi t t ed for simplicity): (2) .... h NP us {NP ,}* {(or [ and)} NP ... works by such authors as Herrick, Goldsmith, and Shakespeare. : ~. hyf)onym I'~author", "Ilerrick'), llyponym( "author", "(;oldsmith "), hyponynl( "author", "Shakespeare") (3) NP {, NP} * {,} o,' other NP Bruises, wounds, broken bones or other

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Acquisition of Hyponyms and Meronyms from Question Corpora

We explore how lexical and ontological relations can be acquired automatically from natural language questions. The focus in this paper is on identifying hyponym and meronym relations by using simple pattern matching. It is shown that natural language questions can provide a significant source for ontological information.

متن کامل

Acquisition of Hypernyms and Hyponyms from the WWW

Recently research in automatic ontology construction has become a hot topic, because of the vision that ontology will be the core component to realize the semantic web. This paper presents a method to automatically construct ontology by mining the web. We introduce an algorithm to automatically acquire hypernyms and hyponyms for any given lexical term using search engine and natural language pr...

متن کامل

Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content

Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containin...

متن کامل

Using WordNet Lexical Database and Internet to Disambiguate Word Senses

The term “knowledge acquisition bottleneck” has been used in Word Sense Disambiguation Tasks (WSDTs) to illustrate/express the problem of the lack of large tagged corpora. In this paper, an automated WSDT is based on text corpora extracted / collected from Internet web pages. First, the disambiguation for the sense of a word, in a context, is based on the use of its definition and the definitio...

متن کامل

Contextual Meta-Knowledge Acquisition from Corpora

This paper looks at the area of automatic acquisition of meta-knowledge for the structuring of very large knowledge bases-(VLKB). It is argued that we will rediscover the need in Natural Language Processing (NLP) for such large knowledge bases and that one possible method for structuring them eeciently lies in association-based statistics gathered from corpora. The discussion sets out the aims ...

متن کامل

Using Hearst's Rules for the Automatic Acquisition of Hyponyms for Mining a Pharmaceutical Corpus

Fully Automatic Thesaurus Generation (ATG) seeks to generate useful thesauri by mining a corpus of raw text. A number of statistical approaches, based on term co­ occurrence, exist for this, but in general they are only able to estimate the strength of the relationship between two terms, not its nature. In this paper we implement Hearst's method of discovering the hyponymy relations which are t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992